Round 1: Technical
✅ Tell me about yourself and any recent projects you have been a part of && Questions related to your projects.
✅ Explain the role of AWS Glue Data Catalog in a Spark job. How do you use it with Spark on AWS?
✅ Write a Spark code snippet to read data from an S3 bucket (in CSV format), filter it based on some condition, and save the result back to S3.
✅ How do you handle data skew in Spark, especially when dealing with large datasets in AWS EMR?
✅ Explain how you would implement incremental data processing using AWS Glue and Spark.
✅ How would you handle schema evolution when using AWS Glue for ETL jobs on data stored in S3 or Redshift?
✅ Explain the difference between AWS Glue DynamicFrame and Spark DataFrame. When would you use each in a data engineering pipeline?
✅ Describe how you would handle data quality checks and validation in an AWS-based data pipeline using Spark.
✅ How would you architect a solution to process streaming data using AWS Kinesis and Spark Structured Streaming?
✅ How do you handle large-scale data processing with AWS Lambda and Spark? What challenges do you face?
✅ How do you ensure fault tolerance and resilience in a Spark job running on AWS EMR?
Round 2: HR
✅ Discussion around my experience and projects, some resume-based questions.
✅ Reason for leaving previous company.
✅ What are you expecting in your next job role?